Exploratory Data Analysis

Data Preparation

The dataset (Republic of South Africa 2023) was comprised of a series of text files of the State of the Nation Addresses (SONA) from 1994 through 2022. Each speech’s content was subsequently ingested, omitting the initial lines. These speeches were then collated into a structured format for more convenient access and manipulation.

Subsequently, essential metadata, including the year of the address and the name of the delivering president, were gleaned. Ater that, the removal of URLs, HTML character codes, and newline characters was performed. Additionally, the date of each address was extracted and appropriately formatted.

To achieve the project’s objectives, each speech was dissected into its individual sentences. This granular breakdown facilitated the mapping of each sentence to its originating president. The finalised structured dataset comprises individual sentences paired with their respective presidents. This dataset was also saved as a csv file for future use.

For the model building, the data was prepared by create a 70-15-15 train-validation-test split, with the same seed being used for each method to ensure fair comparisons.

Number of speeches per president

The bar plot above illustrates the total number of speeches given by each president. Mbeki and Zuma had most speeches in the dataset, with 10 each. This means that there’s a substantial amount of data available for them, which could be advantageous when discerning their linguistic patterns, given that there is not a significant overlap in the sentences of the two presidents. Motlanthe and de Klerk only had one speech each, which may be an issue, due to an imbalance in the data, which may bias the model output later. To explore this further, the number of sentences per president is examined.

Number of sentences per president

The plot above gives a breakdown of the number of sentences spoken by each president. Zuma stands out with the most sentences, further underscoring his prominence in the dataset. Notably, while Mbeki gave three more speeches than Ramaphosa, their sentence count is nearly the same, implying that Ramaphosa’s speeches might be more verbose or detailed. This data provides a deeper understanding of the granularity of each president’s contribution and reaffirms the potential data imbalance to be addressed in model development, especially when considering the fact that de Klerk and Motlanthe have less than 300 sentences each, while the others have well over 1500.

Average sentence length per president

This plot unveils the average sentence length, in words, for each president. A striking observation is that Zuma, despite having the most sentences and speeches, has a relatively concise average sentence length. Conversely, Mbeki and Motlanthe have longer average sentence lengths, with Mbeki being the only president that had over 30 words per sentence, on average. This metric offers insights into the verbosity and style of each president, which can be a useful feature when discerning speech patterns in model building.

Word clouds for each president

The word clouds above offer a visually compelling representation of the most frequently used words by each president. The size of each word in the cloud corresponds to its frequency in the speeches. All the presidents had “will” as their most prominent word and referred to the country many times while speaking (highlighted by the use of the words “south” and “africa”/“african”). Motlanthe seemed to focus more on the economy and public image with the use of words such as “national”, “public” and “government”, whereas Mandela seemed to focus more on the people with the use of words such as “people” and “us”. de Klerk focused more on the constitution and forming alliances during a transitional period, and Zuma focused more on work and the development. These word clouds provide a snapshot of the focal points and themes of each president’s speeches. Distinctive words or terms can be potential features when building predictive models. The words from the wordclouds can also be seen in the bar plots below.

Word frequency distribution for each president

N-gram frequency distributions for each president

Bigrams

Instead of only looking at single word frequency, bigrams can also be used to find the most common two-word phrases. The bigrams above elucidate the distinctive linguistic patterns and thematic foci of each president, presenting opportunities for differentiation. For instance, President Mandela’s frequently used bigrams, such as “South Africans” and “national unity,” reflect his emphasis on nation-building and reconciliation during his tenure. In contrast, President Zuma’s bigrams like “economic growth” suggest a policy-driven discourse concentrated on economic dynamics. However, there are potential pitfalls. Overlapping or common bigrams across presidents, such as generic terms or phrases prevalent in political discourse, could introduce ambiguity, potentially hindering the model’s precision. Additionally, while President Ramaphosa’s bigrams like “South Africa” are distinctly frequent, they are not uniquely attributable to him, as such phrases are likely universal across South African presidencies.

Trigrams

Expanding on the analysis of linguistic markers, trigrams offer insights into the most recurrent three-word sequences employed by each president. The trigram outputs above further refine our understanding of the unique verbal choices and thematic concerns of each leader. For instance, President Mandela’s recurrent trigrams, such as “trade union movement”, underscore his consistent focus on the working class of South Africa. Meanwhile, President Zuma’s trigrams, such as “expaned public works” indicate a focus on the public sector as a whole. Conversely, the presence of generic or universally applicable trigrams, such as “state nation address”, might pose challenges. These broadly-used trigrams, inherent to political addresses across presidencies, might dilute the distinctive features of individual presidents, complicating the model’s task. Moreover, trigrams like “south africa will” from President Ramaphosa, although salient, are emblematic of speeches common to all presidents, making them less distinguishing. Thus, while trigrams can accentuate the nuances of each president’s discourse, the model would benefit from discerning the balance between distinctiveness and generic trigram usage.

Sentence similarity between presidents

2023-10-18 12:01:40.925090: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Bag of Words (BOW) representation

The Bag-of-Words (BoW) visualisation above reveals a pronounced central cluster with substantial overlap across presidential sentences, indicating pervasive shared linguistic elements. This convergence towards common terms suggests that the BoW representation predominantly captures universal themes and terminologies characteristic of political discourse. Such patterns, while illuminating shared linguistic tendencies, underscore potential challenges in predictive modeling, with the BoW approach possibly lacking the granularity to detect distinctive linguistic markers for each president.

TF-IDF representation

Using the TF-IDF representation, the visualization depicts a dominant central cluster, reaffirming the presence of overlapping linguistic constructs across presidential discourses. Unlike the BoW representation, the TF-IDF visualization lacks discernible smaller clusters, and data points appear more dispersed. This dispersion underscores the varied thematic undertones each president might have explored, but the pronounced overlap in the central region suggests that these thematic variations are not sufficiently distinct in the TF-IDF space to provide clear demarcations. The observed patterns emphasize the challenges inherent in solely relying on TF-IDF for capturing the unique linguistic nuances of each president.

Tokenization with Padding representation

Utilising tokenization with padding, the resultant visualization presents multiple clusters, indicating the method’s ability to recognize shared linguistic constructs or thematic groupings within the dataset. Notably, the significant intermingling of presidents within these clusters underscores the shared nature of discourse patterns across different presidencies. The absence of a dominant central cluster, a divergence from the BoW and TF-IDF representations, alludes to a more nuanced and diverse sentence representation in the embedding space, potentially attributed to the emphasis on sentence structure inherent in the tokenization method.

References

Republic of South Africa, The Presidency of the. 2023. “State of the Nation Address.” 2023. https://www.gov.za/state-nation-address.